Sequential design with early stopping (restricted action set) - risk based
Published
May 8, 2025
Modified
May 8, 2025
Load simulation results
# Each input file corresponds to the results from a single simulation# scenario/configuration.# Load all the files into a single list.# files of interestsim_lab <-"sim05-13"flist <-list.files(paste0("data/", sim_lab), pattern ="sim05")toks <-list()l <-list()i <-1for(i in1:length(flist)){ l[[i]] <- qs::qread(file.path(paste0("data/", sim_lab), flist[i])) toks[[i]] <-unlist(tstrsplit(flist[i], "[-.]"))}
Introduction
In the present design, the model is used to compute unit level risk (probability) and assesses decisions based on risk difference for the intervention comparisons by domain. Log-odds scale had limited interpretability for clinical users and is inconsistent in absolute terms across strata with varying baseline rates. It is also useful to explore what an absolute perspective on effectiveness translates to in terms of operating characteristics. Simulation parameters continue to be expressed as log-odds-ratios. This makes the simulation process simpler. Results are presented on both the odds and risk scale. Aim is to give us a better intuition and transparency on the magnitude of effects we are contemplating and determine if these are reasonable assumptions.
We have a single, large, multivariable logistic regression model. One of the sets of parameters accounts for clinician preference for revision type in order to achieve conditional exchangeability across groups. At each interim, we assess the posterior and if a decision threshold is met, we act. For example, if a superiority decision is reached in one of the domains for which this decision type is relevant, then we consider that domain dealt with and all subsequent participants are assigned to receive the superior intervention. We can (and presently do) continue to update the posterior inference for the comparison that has stopped in subsequent interim analyses until we get to the point where all questions have been answered in all domains, at which point the trial will stop.
The priors were as follows:
Reference log-odds of response: logistic distribution, mean 0 and scale 0.47
Silo effects: normal distribution, mean 0 and scale 1
Joint effects: normal distribution, mean 0 and scale 1
Preference effects: normal distribution, mean 0 and scale 1
Treatment effects: normal distribution, mean 0 and scale 1
Superiority and non-inferiority are applicable to some domains and not others, however, we define reference and threshold values for all domains, just in case.
For the superiority decision, a reference value of 0 was used and the probability thresholds were:
0.94 for surgical domain
0.98 for antibiotic duration domain
0.98 for extended prophylaxis domain
0.995 for antibiotic choice domain
For the futility decision (in relation to superiority) a reference value of 0.05 was used and the probability thresholds were:
0.3 for surgical domain
0.25 for antibiotic duration domain
0.25 for extended prophylaxis domain
0.25 for antibiotic choice domain
The above means, for example, that if the probability that the risk difference is greater than 0.05 is less than 0.3 in the surgical domain comparison, then we say the superiority goal is futile.
For the ni decision, a reference value of -0.05 was used and the probability thresholds were:
0.98 for surgical domain
0.925 for antibiotic duration domain
0.98 for extended prophylaxis domain
0.98 for antibiotic choice domain
The above means, for example, that if the probability that the risk difference is greater than -0.05 is greater than 0.925, in the antibiotic duration domain, then we will say the intervention is non-inferior.
The futility decision (in relation to non-inferiority) has a reference value of 0 and the probability thresholds were:
0.25 for surgical domain
0.1 for antibiotic duration domain
0.25 for extended prophylaxis domain
0.25 for antibiotic choice domain
This means, for example, that if the probability that the risk difference is greater than 0 is less than 0.1, in the antibiotic duration domain, then we say the non-inferiority goal is futile.
Figure 1 attempts to put the superiority rules into pictures based on possible scenarios for assessment of the posterior risk difference for an arbitrary domain where superiority is being assessed. The approach, reference values and thresholds apply to all domains where superiority is assessed.
Figure 1: Visualisation of decision rule scenarios for superiority
Analogously, Figure 2 puts the non-inferiority rule into pictures based on possible scenarios for assessment of the posterior risk difference for an arbitrary domain where non-inferiority is being assessed. The approach, reference values and thresholds apply to all domains.
Figure 2: Visualisation of decision rule scenarios for non-inferiority
For this set of simulations, the number of simulated trials per scenario was 1000, the simulation label is sim05-13.
Simulation results
Table 1 shows the cumulative probability of a superiority decision across each of the scenarios simulated (the same information is shown in Figure 3). Operating characteristics are shown only for the relevant domains and the futility of a superiority decision is included in parentheses.
Notes (last edit 2025-04-29):
The ‘average’ surgical revision effect reaches about 86% when both one-stage and two-stage have a moderate effects, i.e. when both of these procedures increase the log-odds or response by the same \(\log(1.75)\). When one-stage or two-stage show an effect (the other having a zero effect) the weighted average effect of revision is lower and hence the power is lower. The decisions are based on the aggregated effect of both revision types, not the methods selected by the clinician for revision.
With the lower threshold value for superiority in the surgical domain, there is a commensurate increase in the type-i assertion probability.
AB extended prophylaxis receives entrants from acute, late and chronic silos and hence has more data to work with, leading to a higher overall cumulative probability of stopping. Ditto with the AB choice domain. So, in general, these have better overall power when effects are present, when contrasted with the surgical domain.
For the 90% power example, the effect (OR) sizes by domain are 1.87, 1.45, 1.73, 1.6 for surgical, ab duration, extended prophylaxis and choice respectively.
Figure 3: Cumulative probabilities for superiority assessments
Table 2 shows the cumulative probability of a non-inferiority decision with futility shown in parentheses (the same information is shown in Figure 5). The results are only shown for the domains for which non-inferiority is evaluated.
Notes:
The “Null effect in all domains” is actually a bit of a misnomer as the true null with regards to the NI decision would be at the NI margin, whereas the scenario refers to the setting where all effects are set to zero. Thus an inflation over the usual type-i assertion probability is to be expected.
In this set of results the cumalative probability of claiming NI by 2500 is 0.76 when the effect size is OR 1.75 (contrasting with 0.67 from the last set of simulation results).
In the null case (when the 6 week and 12 week response are effectively identical) there is only a ~10% cumulative probability of declaring NI. This is purely due to the thresholds we have selected. If equivalence is what we are actually thinking about, then we could potentially look towards evaluating that instead of non-inferiority (see the Figure 4 below).
Figure 4: Decision options based on posterior
The higher power in the scenario where all domains have an effect arises because the surgical domain does not get shut down so you have more participants entering this AB duration domain.
Large (OR 2.5) surgical revision effect (both one and two-stage)
571
1,601
996
916
Large (OR 2.5) surgical revision effect (one-stage only)
1,404
1,620
1,074
918
Large (OR 2.5) surgical revision effect (two-stage only)
1,082
1,657
988
948
Large (OR 2.5) antibiotic duration 6wk effect
1,104
754
1,166
962
Large (OR 2.5) antibiotic ext-proph 12wk effect
1,178
1,750
712
955
Large (OR 2.5) antibiotic choice rifampacin effect
1,095
1,743
1,155
558
Large (OR 2.5) effects in all domains
546
739
762
602
Moderate (OR 1.75) surgical revision effect (both one and two-stage) and antibiotic duration
936
1,657
1,066
931
Effects to achieve 90% power
812
1,150
1,086
1,141
Table 3: Expected number of enrolments to hit any stopping rule (including reaching maximum sample size)
Figure 6 shows the number of participants entering into each of the randomised comparisons by domain and scenario.
The expected values are calculated by extracting the number of participants entering in each analysis and taking the cumulative sum of these, restricted to the relevant strata. For example, the domain 1 (surgical) expected values are based on the participants in the late acute silo that receive randomised surgical intervention. Similarly, the domain 2 (antibiotic duration) expected values are based on the participants across all silos that received one-stage revision.
If a decision was made for a single domain then subsequent enrolments would be assigned to the relevant arm. For example, if a superiority decision was made for domain 1 and the trial was ongoing then all subsequent participants are assigned to the superior intervention. Finally, if a decision was made for all research questions, then the trial would stop early and we use LOCF to propagate the sample size forward to subsequent analyses before computing the expected cumulative numbers by treatment arm.
Figure 7 shows the median value and 95% quantiles for the posterior means obtained from the simulations for each domain and scenario. The estimates are unconditional, in that they ignore the stopping rule and propagate estimates forward with LOCF so that we are working from a random sample (albeit with static imputation) rather than a dependent sample of simulations.
Notes
The AB duration domain suffers from very high variance in the distribution of posterior means that we anticipate to observe. That is, when there is no true effect, we might still see effects as large as \(\pm 25%\) on the absolute risk scale.
Figure 7: Median value of posterior means for odds-ratio treatment effects by domain and simulation scenario
Figure 8 shows the median value and 95% quantiles for the posterior means obtained from the simulations for each domain and scenario. The estimates are unconditional, in that they ignore the stopping rule and propagate estimates forward with LOCF.
Notes
This plot focuses on the risk difference, and suggests that the odds ratios translate to effects in the order of 10-20% on the risk scale, dependent on the silo, domain etc. This variation is to be expected due to the non-linearity in the inverse logit transform. For example, if the baseline log-odds of response was -0.5 and you are contemplating an odds ratio of 2.5 this translates to a risk difference of about 0.23 but if your baseline log-odds of response is 0.2 then, with the same OR, this translates to a risk difference of about 0.2.
where \(Var(\beta_{post})\) and \(Var(\beta_{pri})\) represent the variance associated with the prior and posterior belief for the relevant log-odds ratio for the treatment effects. This is basically just a way to compare the prior and posterior variance. When the posterior is based on negligible data, the variance will be similar to that of the prior and the fraction resolved will be very small. A low fraction resolved (e.g. less than 0.5) suggests that any decision that was made was done so with a substantial amount of uncertainty remaining (you didn’t move far from your prior belief) whereas values close to unity suggest that a lot of the uncertainty has been resolved.
What is obvious from the above plots is also obvious here, the decision made in the AB duration domain are subject to a substantial amount of uncertainty.